31 research outputs found

    Improving Low Power Processor Efficiency with Static Pipelining

    Get PDF
    A new generation of mobile applications requires reduced energy consumption without sacrificing execution performance. In this paper, we propose to respond to these conflicting demands with an innovative statically pipelined processor supported by an optimizing compiler. The central idea of the approach is that the control during each cycle for each portion of the processor is explicitly represented in each instruction. Thus the pipelining is in effect statically determined by the compiler. The benefits of this approach include simpler hardware and that it allows the compiler to perform optimizations that are not possible on traditional architectures. The initial results indicate that static pipelining can significantly reduce power consumption without adversely affecting performance

    Instruction Re-Selection for Iterative Modulo Scheduling on High Performance Multi-Issue DSPs

    Get PDF
    An iterative modulo scheduling is very important for compilers targeting high performance multi-issue digital signal processors. This is because these processors are often severely limited by idle state functional units and thus the reduced idle units can have a positively significant impact on their performance. However, complex instructions, which are used in most recent DSPs such as mac, usually increase data dependence complexity, and such complex dependencies that exist in signal processing applications often restrict modulo scheduling freedom and therefore, become a limiting factor of the iterative modulo scheduler. In this work, we propose a technique that efficiently reselects instructions of an application loop code considering dependence complexity, which directly resolve the dependence constraint. That is specifically featured for accelerating software pipelining performance by minimizing length of intrinsic cyclic dependencies. To take advantage of this feature, few existing compilers support a loop unrolling based dependence relaxing technique, but only use them for some limited cases. This is mainly because the loop unrolling typically occurs an overhead of huge code size increment, and the iterative modulo scheduling with relaxed dependence techniques for general cases is an NP-hard problem that necessitates complex assignments of registers and functional units. Our technique uses a heuristic to efficiently handle this problem in pre-stage of iterative modulo scheduling without loop unrolling

    Effectively Exploiting Indirect Jumps

    No full text
    This paper describes a general code-improving transformation that can coalesce conditional branches into an indirect jump from a table. Applying this transformation allows an optimizer to exploit indirect jumps for many other coalescing opportunities besides the translation of multiway branch statements. First, dataflow analysis is performed to detect a set of coalescent conditional branches, which are often separated by blocks of intervening instructions. Second, several techniques are applied to reduce the cost of performing an indirect jump operation, often requiring the execution of only two instructions on a SPARC. Finally, the control flow is restructured using code duplication to replace the set of branches with an indirect jump. Thus, the transformation essentially provides early resolution of conditional branches that may originally have been some distance from the point where the indirect jump is inserted. The transformation can be frequently applied with often significant reductions in the number of instructions executed, total cache work, and execution time. In addition, we show that with branch target buffer support, indirect jumps improve branch prediction since they cause fewer mispredictions than the set of branches they replaced. Categories and Subject Descriptors: Programming Languages [Compilers]: Optimization---Lan

    Effectively exploiting indirect jumps

    Full text link

    Coalescing Conditional Branches into Efficient Indirect Jumps

    No full text
    Indirect jumps from tables are traditionally only generated by compilers as an intermediate code generation decision when translating multiway selection statements. However, making this decision during intermediate code generation poses problems. The research described in this paper resolves these problems by using several types of static analysis as a framework for a code improving transformation that exploits indirect jumps from tables. First, control-flow analysis is performed that provides opportunities for coalescing branches generated from other control statements besides multiway selection statements. Second, the optimizer uses various techniques to reduce the cost of indirect jump operations by statically analyzing the context of the surrounding code. Finally, path and branch prediction analysis is used to provide a more accurate estimation of the benefit of coalescing a detected set of branches into a single indirect jump. The results indicate that the coalescing transformation can be frequently applied with significant reductions in the number of instructions executed and total cache work. This paper shows that static analysis can be used to implement an effective improving transformation for exploiting indirect jumps

    Effectively Exploiting . . .

    No full text
    This paper describes a general code-improving transformation that can coalesce conditional branches into an indirect jump from a table. Applying this transformation allows an optimizer to exploit indirect jumps for many other coalescing opportunities besides the translation of multiway branch statements. First, data ow analysis is performed to detect a set of coalescent conditional branches, which are often separated by blocks of intervening instructions. Second, several techniques are applied to reduce the cost of performing an indirect jump operation, often requiring the execution of only two instructions on a SPARC. Finally, the control ow is restructured using code duplication to replace the set of branches with an indirect jump. Thus, the transformation essentially provides early resolution of conditional branches that may originally have been some distance from the point where the indirect jump is inserted. The transformation can be frequently applied with often signi cant reductions in the number of instructions executed, total cache work, and execution time. In addition, we show that with branch target bu er support, indirect jumps improve branch prediction since they cause fewer mispredictions than the set of branches they replaced

    Code Optimizations for a VLIW-Style Network Processing Unit

    No full text
    The explosive growth in network bandwidth and Internet services such as QoS (quality of service) and SLA (service level agreement) monitoring have created the need for new networking hardware called a Network Processing Unit (NPU). In order to rapidly reconfigure the NPU for frequently varying Internet services and technologies, a high-performance C compiler is urgently needed. Several code generation techniques, which are intended to meet the high code quality demands of other types of application specific instruction-set processors (ASIPs) like digital signal processors (DSPs), have already been developed. However, these techniques are insufficient for NPUs due to striking architectural differences such as asymmetric data paths. The main purpose of this paper is to discuss our recent experience with the development of a commercial compiler for a new NPU called the Paion PPII, which is basically a packet engine for NPU to meet the growing need for new high-bandwidth communication equipment targeted for Internet routers and ethernet adapters. For this purpose, we will first show the architectural challenges posed by the target NPU. Then, we will describe several compiler techniques that we found to be effective for the target NPU with various unorthogonal architectural features. The current implementations of the PPII use a VLIW (Very Long Instruction Word) architecture. So, we handled this VLIW-style architecture by employing a simple code compaction scheme which packs multiple parallel instructions into one long instruction word. The experimental results show that our techniques are effective for significantly reducing the dynamic instruction count

    Improving Performance by Branch Reordering

    No full text
    The conditional branch has long been considered an expensive operation. The relative cost of conditional branches has increased as recently designed machines are now relying on deeper pipelines and higher multiple issue. Reducing the number of conditional branches executed can often result in a substantial performance benefit. This paper describes a code-improving transformation to reorder sequences of conditional branches. First, sequences of branches that can be reordered are detected in the control flow. Second, profiling information is collected to predict the probability that each branch will transfer control out of the sequence. Third, the cost of performing each conditional branch is estimated. Fourth, the most beneficial ordering of the branches based on the estimated probability and cost is selected. The most beneficial ordering often included the insertion of additional conditional branches that did not previously exist in the sequence. Finally, the control flow isrestructured to reflect the new ordering. The results of applying the transformation were significant reductions in the dynamic number of instructions and branches, as well as decreases in execution time
    corecore